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Abstract Clustering is considered as one of the important 
data mining techniques. Document clustering is among 
many applications of clustering. The traditional clustering 
algorithms are proven inefficient for clustering rapidly 
generating large real world datasets. As a solution, tradi- 
tional clustering algorithms are modified using distributed 
programming paradigm. MapReduce is a popular dis- 
tributed programming paradigm designed for Hadoop dis- 
tributed framework. This paper demonstrates a MapReduce 
based modification of K-Means clustering algorithm for 
document datasets. The result shows that the proposed 
algorithm is efficient than traditional K-Means for all size 
of document datasets clustering. The experiments also 
show that the MapReduce clustering works more efficiently 
when the dataset size and Hadoop cluster sizes are large. 


Keywords MapReduce - Hadoop - Parallel K-means - 
Document clustering - Distributed computing 


Introduction 


Data mining is a process to obtain useful knowledge from 
raw datasets [1]. Clustering is a well-known data mining 
technique which groups similar data objects from a dataset 
using similarity among the data objects [2]. The clustering 
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algorithms are used in text datasets for implicit grouping of 
similar documents based on the occurrence of the most 
similar words among the documents [3]. The document 
datasets are processed to obtain a set of unique words. The 
more common the words in two text files the more similar 
the text files and thus, more it claims to be in same group 
[4]. The application of document clustering are many such 
as to organize documents in hierarchy, to find out a par- 
ticular text file from a document dataset of multiple folders 
and text files, to filter text file information, to name a few 
[5, 6]. 

K-Means is the most widely used clustering algorithm 
due to its simplicity and usefulness [7]. K-Means is a 
partition based clustering algorithm which categorize the 
input dataset objects into pre-specified number of groups, 
i.e. K. In K-Means, firstly K numbers of objects are ran- 
domly chosen from the dataset (known as centroid or 
cluster centres) then the distance of each object in the 
dataset are measured with these cluster centres. The 
resulting distance provides the basis of similarity between 
every two objects in the dataset. If an object is most similar 
(least in distance) to any of K cluster centres then that 
object is assigned to the group of that cluster centre. After 
this the mean value of the newly formed group (that 
belongs to each cluster centre) is taken and these K new 
mean values become new cluster centres. The same pro- 
cesses are repeated until any new cluster centre and the 
cluster centres of the last iteration become same or some 
other user defined criterion is satisfied. The time com- 
plexity of K-Means is O(nkt), where n is the objects in the 
dataset and t is the iterations required in this particular 
clustering process. K-Means is widely used in document 
clustering literature. As K-Means can only work with 
numeric datasets, the document dataset is preprocessed to 
numeric form before input in K-Means. 
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Due to digitalization of every sector and wide (and easy) 
access of internet in recent times, document datasets, like 
every other datasets, are generating in large volumes. This 
makes efficient and effective clustering of these documents 
difficult [8]. Recent works suggests that instead of using 
traditional clustering algorithms for such large datasets, it 
is useful to modify K-Means using modern distributed 
programming paradigm and execute modified K-Means 
using distributed environment [9]. MapReduce is a dis- 
tributed programming paradigm which implements an 
algorithm in two steps (using two functions named Map 
and Reduce) in order to execute the algorithm in Hadoop 
distributed platform. Hadoop’s master-slave architecture 
executes MapReduce coded algorithm from master node 
and master distributes the job into connected slave nodes. 
The slave nodes execute the part of job (known as tasks) 
assigned to it based on its own hardware capacity [2]. The 
efficiency of the Hadoop cluster depends on the design of 
the algorithm in MapReduce and total processing capacity 
of cluster nodes. 

In this study, a novel MapReduce based modified K- 
Means algorithm is proposed and implemented on the top 
of Hadoop platform. The uniqueness of the study lies in the 
extensive study of the algorithm using many experiments. 
Although each of the experiments use the same proposed 
algorithm, the design of experiments are based on different 
combinations of cluster size and dataset size and reading of 
corresponding execution time. The execution time obtained 
in proposed K-Means is compared with traditional K- 
Means execution. In order to obtain experimental analysis 
of results the proposed algorithm is executed with different 
sizes of Hadoop nodes for clustering different sizes of 
document datasets. 

The rest of the paper is organized as follows. “Literature 
review” section provides the literature survey of the similar 
and related works of K-Means clustering. “Methodology” 
section provides the details of our document clustering 
study in MapReduce, specially focusing on proposed 
algorithm in detail. “Experimental results” section pro- 
vides the observation of experiments and its analysis. The 
last section concludes this work. 


Literature Review 


Among many of the clustering algorithms K-Means is 
undoubtedly one of the most popular and widely used 
algorithms. The main reason behind the popularity of K- 
Means is its simplicity [10]. 

The working of K-Means starts with prior selection of 
number of resulting clusters, K. Based on this number, K- 
Means randomly selects K datapoints and then calculates 
the distance between these K points (known as centroids) to 
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all other data points. The nearest datapoints to a centroid 
are assigned to it. This process assigns each data points to 
one of the centroids. Then mean values for each of the 
centroids are calculated using the mathematical mean. That 
is, the sums of each of the datapoints are divided by the 
number of datapoints assigned to the centroid. These mean 
values represent new centroids for this step. The process is 
repeated iteratively until a user specific condition is met. K- 
Means is used in many document clustering research and 
applications as well as in other applications and found 
useful [11-13]. 

The problem with K-Means, like many other clustering 
algorithms, is that it takes too much computing time while 
clustering a large data repository. This large time con- 
sumption makes the data analytical operations ineffective 
as per market and research requirements [14]. This real 
world situation dictates us to modify existing clustering 
algorithms such that the modified algorithm becomes fast 
and effective in large data clustering. It is also required to 
make use of modern programming frameworks and com- 
puting platforms which are crafted for large data process- 
ing in an effective way [15]. 

Literature provides many modifications of K-Means for 
effectively clustering large real world datasets. Among 
them a few of the modifications of K-Means are based on 
technical up gradation of traditional sequential K-Means 
and others are based on modifying K-Means algorithm on 
different programming paradigms and platforms. This 
survey briefs the recent modifications of K-Means in its 
technical -Means and  paradigm/platform related 
modifications. 

The outputs of K-Means clustering are highly dependent 
on initial selections of centroids. There is no uniform for- 
malized technique used to select the K centroids. Bradley 
and Fayyad [16] have feed K-Means with pre-calculated 
effective K values by a preliminary usage of K-Means for 
doing so. It attains good result of clustering. Ball and Hall 
[17] proposed a technique named ISODATA which selects 
the probable right value of K using a process of splitting 
and integration. This work again requires another user 
supplied measure (number of processes) in replacement of 
K. Arthur and Vassilvitskii [18] develops a quick seeding 
method for dynamically selecting centroids. Kanungo et al. 
[19] uses a Kd-tree structure of data which represents 
K dimensional data object values. This is accomplishing by 
supplying a subset of possible centroids from a set of data 
points which is then filtered and provided to the children 
nodes. The tree not requires no update iteration wise and 
thus saves time. Frahling et al. [20] proposed a technique 
which uses coresets for opting multiple values of K cen- 
troids. This work makes the process of K-Means takes less 
time and produce good quality clustering. K-means is 
adapted by Amorim et al. [21] by eliminating noise using 
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Minkowski metric. It allocates feature weight using Min- 
kowski measure for all centroids it uses. Likas et al. [22] 
develops a variant of K-Means which uses a process for 
each clusters. 

Zhang and Forman [23] perceive the concept that clus- 
tering a voluminous dataset requires parallelization of 
algorithms. It projected a parallel K-Means so as to gain 
efficiency. A function is premeditated which takes data 
points N and K centroid to approximate value M. The 
N points are spread between a set of machines (C). Pro- 
jected by value of M, K centroids are designated and copies 
are reserved in all machines (C). Then all C calculates 
respective parts of statistics S. The processor accumulates 
the S value and disseminate to all of the C machines. As per 
the value of S, each C computes new centroids. A weakness 
of this work is the calculation time and storage space 
obligations at the main (central) machine reduce the entire 
work efficiency. 

To obtain higher efficiency, the parallelization of K- 
Means is attained using parallel paradigms OpenMP and 
MPI. MPI (Message Passing Interface) delivers paral- 
lelization using message passing mechanism to design 
well-organized and mountable parallel applications. In [6], 
K-Means is adapted on MPI by accumulation of a merging 
algorithm. The parallelization is attained by regularly dis- 
pensing data items to different processes then duplicating 
cluster centers. After each cycle (iteration), a merging 
algorithm conglomerates created centroids from different 
processes to a concluding set. The technique is established 
and effective for huge datasets. This work is silent on the 
effectiveness of the proposed work in terms of varying 
number of processes it uses. Similarly in [24], MPI model 
parallelize K-Means by distributing algorithmic trades 
amid CPUs and retrieving from side to side distributed 
memory to receive the output in less execution time. 

In [25], K-Means executed in diverse computing 
frameworks like OpenMP, MPI and Cuda-C. It is detected 
that OpenMP performs best for lesser voluminous datasets 
and Cuda performs best for large datasets. Likewise in 
[26], K-Means is experimented with respect to Cuda, 
OpenMP and MPI and the outcome demonstrate that the 
execution of K-Means, for all three frameworks, are quite 
efficient as compared to sequential K-Means and the ulti- 
mate performance differs by diverse hardware combina- 
tions. In [27], author experimented K-Means with respect 
to three parallel frameworks: MPI, OpenMP and MapRe- 
duce. It is perceived that OpenMP is a finest choice among 
the frameworks when the datasets are smaller in size, the 
processor cores are many in number and sufficient memory 
is available whereas for modest input dataset volume MPI 
is the best choice. But for large real world datasets 
MapReduce is proved experimentally as the best choice as 
compared to the other two. K-Means spends most of its 
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time of computation in iterative part of distance calculation 
between centroids and data points. In [28], authors perceive 
that optimizing this iterative nature is the main part for 
obtaining efficiency in parallel frameworks. 

Li et al. [29] have implemented a distributed and par- 
allel K-Means using MapReduce by separating the iterative 
part of distance calculation for partial set of datasets in 
different computing machines and finally accumulating the 
result in a separate computer. The work is not stated well as 
it has used just 4 computing machines and the datasets used 
only few dimensions. The necessities of MapReduce 
implementation of K-Means using Hadoop distributed 
computing framework is provided below: 


e OpenMP and MPI are only useful for small/mid-sized 
datasets. MapReduce is efficient to gain performance 
even for large datasets [30]. 

e If distance calculation (the iterative part in K-Means) is 
calculated in different machines and accomplished in 
parallel on Hadoop using MapReduce, then it will attain 
higher efficiency [30]. 

e The distributed architecture of Hadoop automatically 
handles the data distribution and accumulation, taking 
the responsibility away from the programmer [31]. 

e Hadoop cluster of distributed computing can be a good 
choice as it can be created using the commodity 
hardware [32]. 

e Hadoop provides a good scalability as it can include 
any number of commodity hardware machines at any 
point of time [33]. Hadoop provides a higher and 
variable level of fault tolerance by providing user 
manageable replication factor for data shared across 
nodes of computing devices [34]. 


This work is an experimental K-Means clustering which 
intelligently selects specific part of K-Means in different 
parts of MapReduce and executed over Hadoop. The 
experiments are conducted in broader terms using a 
Hadoop cluster of 10 nodes and using a document dataset 
of 10,000 dimensions and variable sizes. This setup of 
experimentation provides an in depth analysis of the pro- 
posed algorithm and its result. 


Methodology 


This work suggests a MapReduce based K-Means for 
document clustering. The proposed algorithm is executed 
over Hadoop platform for distributed document clustering. 
Before describing the details of K-Means modifications in 
MapReduce, we provide a briefing on MapReduce pro- 
gramming paradigm. 
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MapReduce Programming Paradigm 


The MapReduce paradigm is designed based on two 
functions named Map and Reduce. The Map function 
(known as Mappers) is input with < key, value > pairs and 
the result obtained from Mappers (is also in < key, 
value > pairs) are fed to the Reduce function(s) (known as 
reducer). 


Mapper : (k1, v1) — [(k2, v2)] 
Reducer : (k2, |v2|) — [(k3, v3)] 


where kl & k2 are in-key and out-key respectively, and v1 
and v2 are in-value and out-value respectively. k3 and v3 
are final key and final value respectively. lv2I is the out 
value list. 

Hadoop cluster manages the processing part using 
MapReduce and the storage management part using 
Hadoop Distributed File System (HDFS). The master node 
of a Hadoop cluster splits the large datasets into fixed size 
data splits and transport the data splits into slave nodes 
through connected network. The management of dataset 
related task such as data split making, transferring the data 
splits to the slave nodes, tracking of data splits, initiation of 
appropriate action during node failure etc. is taken care by 
HDFS automatically. The mappers execute in different 
slave nodes at the same time and accomplish the task 
defined in the mapper function body (in submitted 
MapReduce code at Master node) for the specific data split 
the slave received. The reducer collects intermediate 
results from mappers in the form of < key, value > pairs 
and calculate accumulative value for each key obtained 
from multiple mappers. This way MapReduce obtain out- 
put using the “Read many times but write once” model. 

An overview of MapReduce style execution of a data 
processing task is provided in Fig. 1. 


Dataset 


The input The HDFS of Each data split is 
dataset is Hadoop master splits processed using a 
submitted to the dataset and MapReduce mapper 


Hadoop for 
processing 


assigns to HDFS of 


slave nodes nodes 


Fig. 1 Data processing in MapReduce style over Hadoop platform 
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Proposed K-Means Based on MapReduce 


This work modifies the traditional K-Means algorithm 
using MapReduce paradigm for clustering document 
datasets. The well-known and widely used traditional K- 
Means algorithm steps are provided as below: 


1. Randomly select k initial data objects as cluster centres 
C = {cl, c2,...,ck}, from the input dataset containing 
data objects O = {ol, 02,..., on}. 

2. For each data object O, find the distance w.r.t cluster 
centres C. Assign O to the nearest cluster center C. 

3. Calculate the mean value for each C and the respec- 
tively assigned data objects. 

4. Keep on repeating the steps 2 and 3 till the cluster 
center mean value obtained in this iteration becomes 
same as last iteration. 


Document datasets are collection of a set of text files. 
One of the features of K-Means algorithms is that K-Means 
only works on numeric datasets. So it’s not possible to 
cluster documents by K-Means until the text files are pre- 
processed in such a way that it becomes representable by 
numeric values. The document dataset have transformed 
using vector space model (VSM) [35], before input to 
proposed K-Means. 

Vector space model is a widely used and accepted 
model for transforming a document dataset into low-range 
numeric values suitable for clustering (especially when we 
use a partition and distance based algorithm like K-Means). 
This model assigns numeric values for each unique word 
for a document based on the number of times it occurs in 
the dataset. The numeric values for words (i.e. terms) are 
generated using term frequency-inverse document fre- 
quency (tf*idf) weighting method. The tf*idf value repre- 
sents the importance of a word in the dataset. The tf 
represents the measurement of how often a word occurs in 
a document. A word that occurs frequently is probably 
important to that documents meaning. Document frequency 


The output of mappers is 
assigned to reducers at slave 
nodes. For large mapper output, 
locally at slave node a combiner 
accumulates local value for 
keys then fed to reducers 


The outputs from 
reducers are gathers 
at master to produce 
final output of data 

processing 
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is how often a word occurs in an entire set of documents 
(that is in the entire dataset). This speaks about common 
words that just appear everywhere no matter what the 
topic, like ‘a’, ‘the’, ‘and’, etc. Thus a balanced measure of 
the relevancy of a word to a document is term fre- 
quency/document frequency or term frequency * inverse 
document frequency. That is, take how often the word 
appears in a document, over how often it just appears 
everywhere. That gives a measure of how important and 
unique this word is for this document. The log value of the 
tf and idf are calculated. Since word frequencies are dis- 
tributed exponentially, log of the tf and idf give us a better 
weighting of a word’s overall popularity. For example, 
suppose a term “clustering” occurs 1 million times in 
“docl” and 2 million times in “doc2”. So here we can’t 
distinguish between the relevance of “doc!” and “doc2” 
as both contains very high frequency of “clustering” term. 
The log can be used for lessen the importance of high 
frequency term. For the above example, when we use log2, 
the frequency of 1 million is now reduced to 19.9. The 
tf*idf are stored in an input file which represents each text 
file and the words in the file in numeric normalized 
quantities. A text file in the document dataset, after trans- 
formation into vector space model, become one line in the 
input file and separated with other text using a colon (;). 
The K-Means algorithm is fed with this input file 
through Hadoop master node. The randomly selected 
centroids are chosen using user specified value K. These 
centroids are K lines of the input file and it calculates 
distance using Euclidean measure for each line of the input 
file, as per standard K-Means algorithm. The MapReduce 
implementation of K-Means makes traditional K-Means 
modified by taking a specific part to be implemented using 
the mapper and other part by using the reducers. This 
selection is a critical part of the MapReduce based K- 
Means. The distance calculation between centroids and text 


Fig. 2 The proposed K-Means 
workflow 
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file lines (which are considered as one data object) in K- 
Means are an iterative part whereas the new centroid cal- 
culation from the values obtained from last iterations in K- 
Means are a sequential part. Hence, it is decided to put the 
distance calculation into mappers and centroid recalcula- 
tion into the reducers. The proposed K-Means follows this 
methodology until the centroid values become unchanged 
from last iteration. Figure 2 presents the proposed K- 
Means workflow. 

The Pseudocode for Mapper and reducers are provided 
below: 


Pseudocode for Mapper 


Input: Vector space transformed numeric objects O = {ol, 
o2,..., on} and randomly chosen k objects as cluster centers 
C = {cl, c2,...,ck} 

Output: intermediate (C; O,) pairs, 1 <i<n and 
l<j<k 


1. For each object O in the input file do 
Calculate distance using Euclidean distance measure 
wrt C 

3. Assign O to C where the distance is minimal 


Pseudocode for Reducer 


Input: cluster centers C = {cl, c2,...,ck} and respective 
assigned objects O = {ol, 02,..., om, m < n} obtained 
from mappers 

Output: New cluster centres 


1. Calculate new centroid value by recalculation of object 
mean values for each C 
2. Store the global cluster center values in the HDFS 


Input Raw Document 
Dataset 


Input file in vector space model 


Mapper n: calculate 
distance for split n 


"4 


Mapper 2: calculate 
distance for split 2 


Final Centroids 
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3. Repeat until the cluster center values obtained from 
this iteration becomes same as the last iteration 


As shown above, the reducer collects the mapper output 
of cluster center and data object pairs from many nodes 
(line 1) and then recalculates the new cluster center values. 
Next the new (modified) cluster center values are written in 
the file and broadcasted to all the splits (i.e. for mapper 
input) on the Hadoop cluster. The process stops when 
cluster center values of two consecutive iterations become 
same. 


Experimental Results 


The proposed algorithms are experimented on a Hadoop 
cluster of variable sizes. The Hadoop cluster is made of 
commodity hardware of Intel Core 2 Duo processors, 8 GB 
RAM and 80 GB hard disk drives. The Hadoop cluster 
nodes are connected using an Ethernet IPV4 LAN that 
provides highest speed of 100 Mb/s. Each node is set up 
with Ubuntu 14.06 operating system and Hadoop version 
2.7.2 is used for the cluster creation. The maximum size of 
Hadoop cluster nodes used is 10. 

The document datasets used in the experiments are self- 
generated by surfing the web and storing to respective 
directories of the dataset. The dataset creation is motivated 
by the standard document dataset named “20_newsgroups” 
which is available in http://qwone.com/~ jason/20News 
groups/. Like this dataset, our self-created datasets are also 
comprised of 20 directories where each of the directories 
contains different categories of news files such as business, 
sports, science, entertainment, life and style, politics, 
international affairs, tech and gadgets, health, culture etc. 
The news are collected for each segment from popular 
newspaper articles such as The Hindu, Times of India, The 


Table 1 Execution time in seconds 
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Telegraph, Hindustan Times, Deccan Chronicle, Deccan 
Herald, Financial Express, Mumbai Mirror, The Economic 
Times, The Guardian, The Press, The Statesman etc. 

The sizes of the self-created datasets are 100, 250, 500, 
750 and 1024 megabytes and the different Hadoop cluster 
sizes used for experimentation are 3-node, 5-node, 8-node 
and 10-node clusters. The variation in dataset sizes and 
cluster sizes help supports in evaluating the performance 
gain of our proposed algorithm effectively over traditional 
K-Means. It also helps us to gain an insight to evaluate the 
performance of proposed algorithm on different sized 
datasets with respect to different cluster sizes. Table 1 
provides the summary of execution time taken while exe- 
cuting proposed and traditional K-means with respect to 
different Hadoop cluster and datasets. This table provides a 
framework to analyze and evaluate the performance dif- 
ferences between proposed and sequential K-Means. 

An effective technique for comparing two quantities is 
ratio calculation. Ratio depicts one numeric value can be 
contained how many times within other numeric value. To 
critically evaluate the performance of proposed algorithm 
with different Hadoop cluster and dataset size and to 
compare it with traditional K-Means, execution time of 
100 MB of dataset is considered as unit of comparison and 
ratio on execution time for other datasets are calculated. 
Table | shows this calculated ratio in rows below of each 
execution using traditional K-Means in single standalone 
computer and proposed parallel K-Means using different 
cluster size. It is clearly observed from the table that pro- 
posed parallel K-Means outperformed traditional K-Means 
even on 3-node cluster and small 100 MB dataset. It is also 
observed that the big the dataset becomes the efficient the 
proposed K-Means in terms of execution time. For exam- 
ple, in 10 node cluster processing time for clustering 
100 MB dataset is 52 s whereas for 1 GB dataset, which is 


K-means execution Node (s) Dataset size 
100 MB 
Sequential Standalone 151 
Ratio 1 
Proposed 3-Node cluster 140 
Ratio 1 
Proposed 5-Node cluster 120 
Ratio 1 
Proposed 8-Node cluster 79 
Ratio 1 
Proposed 10-Node cluster 52 
Ratio 1 


250 MB 500 MB 750 MB 1 GB 
195 520 637 649 
1.3 3.4 4.2 4.3 
164 300 324 403 
1.17 2.14 2.3 2.9 
140 176 274 365 
1.16 1.46 2.83 3 
114 135 165 180 
1.44 1.7 2 2.27 
87 110 133 141 
1.67 2.11 2.55 2.71 
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10 times larger in size, it takes only 141 s providing ratio 1: 
2.71. 

Table 2 provides the performance ratio with respect to 
different execution strategies i.e. traditional K-Means and 
proposed K-Means execution in different Hadoop clusters. 
It can be observed that the bigger the Hadoop cluster is the 
better the performance it provides. For example, a 10 node 
cluster can perform clustering job 4.6 times faster than 
traditional standalone computing and 2.85 times faster than 
3-node cluster for 1 GB document dataset. 

Data scale-up is used in evaluating MapReduce pro- 
cessing as it gives a clear picture of efficiency gain using a 
MapReduce based algorithm. Data scale-up shows the 
change in execution time taken by an algorithm for each 
size of datasets used for experimentation. Figure 3 pro- 
vides a comprehensive overall scenario for the data scale- 
up obtained from the experiments in a line graph. The 
horizontal part of the line graph represents the quantity of 
different data sizes (for example 100 Megabyte as 100 mb) 


Table 2 performance ratio with respect to different execution strategy 
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used for experimentations and the vertical part of the line 
graph represents the execution time obtained from different 
executions of K-Means. The lines in the graph are colored 
and each colored line represents a particular type of K- 
Means execution used in the experimentation. The repre- 
sentative meaning of each colored lines are provided in the 
right hand side corner of the graph. For example the top 
deep blue line represents sequential K-Means execution 
and the bottom sky blue line represents the proposed K- 
Means execution with 10-node Hadoop cluster. The bar 
graph representation of Fig. 3 is provided in Fig. 4. 
Cluster scale-up is also used in evaluating MapReduce 
processing as it gives a clear picture of efficiency gain 
using a MapReduce based algorithm. Cluster scale-up 
shows the change in execution time taken by an algorithm 
for different size of Hadoop cluster used for experimenta- 
tion. Figure 5 provides a comprehensive overall scenario 
for the cluster scale-up obtained from the experiments in a 
line graph. The horizontal part of the line graph represents 


Dataset size and ratio K-means execution time in second 


Standalone 3-Node cluster 

100 MB 151 140 
Ratio 2.9 2.7 

250 MB 195 164 
Ratio 2.24 1.88 
500 MB 520 300 
Ratio 4.7 2.12 
750 MB 637 324 
Ratio 4.8 2.43 
1 GB 649 403 
Ratio 4.6 2.85 


Fig. 3 Line graph of data scale- 
up 
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Fig. 4 Bar graph of data scale- 
up 
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the number of computing nodes used for experimentations. 
For example, l-node standalone traditional K-Means is 
represented as “sequential” and for proposed algorithm the 
number of nodes used in Hadoop cluster for experimenta- 
tion is used (such as 10-node for experimentation of the 
proposed algorithm over 10-node Hadoop cluster). The 
vertical part of the line graph, like Figs. 3 and 4, represents 
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Data Scale-up 
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the execution time obtained from different executions of K- 
Means over different data sizes. The lines in the graph are 
colored and each colored line represents experimentations 
over different data sizes. The meaning of each colored lines 
are provided in the right hand side corner of the graph. For 
example the top deep blue line represents experimentation 
over 100 Megabytes of dataset and at the bottom sky blue 
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line represents the experimentation 1 Gigabyte of dataset. 
The bar graph representation of Fig. 5 is provided in Fig. 6. 


Conclusion and Future Work 


It is the requirement of time to modify traditional cluster- 
ing algorithms in order to cluster large real world datasets. 
The coding paradigm of MapReduce for Hadoop platform 
is an efficient option to modify the clustering algorithms. 
This work proposed a MapReduce based modification of K- 
Means for document clustering. The contribution of the 
paper is that the MapReduce based K-Means is experi- 
mented intensively with 5 different sized self-created 
document dataset with respect to 4 different sized Hadoop 
clusters in order to explore the optimal performance mea- 
sure of proposed algorithm. The proposed algorithm is also 
compared with traditional K-Means as well. The result of 
the proposed algorithm clearly shows that its efficiency 
performs better than the traditional K-Means for all dif- 
ferent dataset sizes experimented. The analyzation results 
also shows the Hadoop cluster overhead makes the pro- 
cessing time uneven for different Hadoop cluster size. It is 
also determined that the large the dataset size and the 
Hadoop cluster size, the better performance MapReduce 
based K-Means algorithm provides. 

As a future work, the proposed algorithm can be 
experimented with varying K-Means centroids and varying 
input data split size. At the same time the output cluster 
quality analysis can be performed in order to obtain the 
optimal cluster number and best split size for obtaining 
both the efficiency and quality of clustering. Soft com- 
puting techniques like fuzzy sets can also be used for 
finding better quality clusters. 
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